Using Misclassification Analysis for Data Cleaning
نویسندگان
چکیده
It is posted here for your personal use. No further distribution is permitted. Data cleaning is a pre-processing technique used in most data mining problems. The purpose of data cleaning is to remove noise, inconsistent data and errors in order to obtain a better and representative data set to develop a reliable prediction model. In most prediction model, unclean data could sometime affect the prediction accuracies of a model. In this paper, we investigate classification problem, which make use of misclassification analysis technique for data cleaning. To demonstrate our concept, we have used artificial neural network (ANN) as the core computational intelligence technique. We use three benchmark data sets obtained from the University of California Irvine (UCI) machine learning repository to investigate the results from our proposed data cleaning technique. The experimental data sets used in our experiment are binary classification problems, which are German credit data, BUPA liver disorders, and Johns Hopkins Ionosphere. The results from our experiments show that the proposed cleaning technique could be a good alternative to provide some confidence when constructing a classification model.
منابع مشابه
Data Cleaning for Classification Using Misclassification Analysis
In most classification problems, sometimes in order to achieve better results, data cleaning is used as a preprocessing technique. The purpose of data cleaning is to remove noise, inconsistent data and errors in the training data. This should enable the use of a better and representative data set to develop a reliable classification model. In most classification models, unclean data could somet...
متن کاملAdherence to osteoporosis pharmacotherapy is underestimated using days supply values in electronic pharmacy claims data.
PURPOSE Days supply (prescription duration) values are commonly used to estimate drug exposure and quantify adherence to therapy, yet accuracy is not routinely assessed, and potential inaccurate reporting has been previously identified. We examined the impact of cleaning days supply values on the measurement of adherence to oral bisphosphonates. METHODS We identified new users of oral bisphos...
متن کاملBinary Regression With a Misclassified Response Variable in Diabetes Data
Objectives: The categorical data analysis is very important in statistics and medical sciences. When the binary response variable is misclassified, the results of fitting the model will be biased in estimating adjusted odds ratios. The present study aimed to use a method to detect and correct misclassification error in the response variable of Type 2 Diabetes Mellitus (T2DM), applying binary ...
متن کاملAn application of Measurement error evaluation using latent class analysis
Latent class analysis (LCA) is a method of evaluating non sampling errors, especially measurement error in categorical data. Biemer (2011) introduced four latent class modeling approaches: probability model parameterization, log linear model, modified path model, and graphical model using path diagrams. These models are interchangeable. Latent class probability models express l...
متن کاملتحلیل وضعیت آنژین صدری بر اساس احتمالات طبقه بندی نادرست عامل خطر سیگار در مطالعه قند و لیپید تهران، 79-1378
Misclassification of disease status and risk factors is one of the main sources of error in studies. Wrong assignment of individuals into exposed and non-exposed groups may seriously distort the results in case-control studies. This study investigates the effect of misclassification error on odds ratio estimates and attempts to introduce a correction method. Data on 3332 men aged 30-69 years fr...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012